March Madness Bracket

Why?

Well, why indeed? I don't even like basketball and I don't follow college sports at all (I am an avid NFL fan though). One of my friends from UT Austin (so way back in my physics days) posted on Facebook about something called "Coder's Bracket" (which no longer exists so I haven't worked on this since 2016). You had to create a javascript routine to predict the winner of every match. The site provided methods to get team stats for each of the 64 teams (i.e. team.getRank()). So for instance, one simple routine, is you could choose a winner based on the ranking (which, obviously, actually works pretty well). I joked on his post that if anyone has all of the stats for previous brackets, I could use machine learning to create an algorithm...I still had a couple of days until the tournament started so I ended up doing just that.

Did it Work?

I would consider the results mixed but positive. I played this "Coder's Bracket" challenge from 2014-2016 (three years) and then they no longer maintained it so I haven't messed with this since (I really should). My first attempt wasn't very successful and I think it ended up somewhere in the 50th percentile. Then again, my first year, I threw this whole thing together in about 2 days. Since you could submit more brackets, even after the tournament started, I did and those did much better (obviously you couldn't enter these into the contest, you just got an idea of your score).

I was actually very successful in 2015. I think I submitted three brackets and all three were above the 80th percentile and I even had one bracket that was in the top 20 overall (I think among something like 10,000 entries) until Kentucky blew everyone's bracket when they lost in the Final Four to Wisconsin. Overall I pretty consistently was in the 80-90th percentile.

Stats

So this was probably the biggest challenge. Where can I find the following stats for every team in the tournament going back as far as I can.

Seed (1-16) Rank (Overall) Wins Losses
Games Played Winning Percentage FG Made FG Attempted
FG Percentage FT Made FT Attempts FT Percentage
3-pt Made 3-pt Attempted 3-pt Percentage Total Rebounds
Offensive Rebounds Defensive Rebounds Rebounds Per Game Steals
Steals Per Game Blocks Blocks Per Game Assists
Assists Per Game Turnovers Turnovers Per Game Round (in tournament)

This data is available from many different websites and can pretty easily be found. For instance, let's say I want the stats for the 2015 Kentucky Wildcats (just choose a website). So now I just repeat, 64 times for each team—and for each year going back as far as I'm willing to.

Scrape HTML To Automate Data Gathering

I ended up using Sports Reference because 1) it had all of the information I needed easily accessible from a single webpage and 2) everything was written in pure HTML (no server calls to dynamically display data) so the data I needed was embedded in the HTML source.

I used jsoup to parse the HTML. This was fairly tedious work since I had to manually figure out how to find the data I needed (usually could find it by an id or class). However this was only necessary for a few things:

  • Find the links to each year's tournament data
  • Find the results for each game (and what round it was)
  • Find links to team stats (from results)
  • Parse team stats page for the data I needed

Here is a code snippet which gives you some idea of how I accomplished this. Reading the brackets themselves are more complex, but the idea is the same (there's just a lot more to look for).

final Document source = Jsoup.parse(new URL(link), 2000);

// select all elements which have 'bracket' set as their class attribute
final Elements bracketBody = source.select(".bracket");

for (Element bracket : bracketBody) {

	// check to see if first child is a tbody element...if so, set
	// bracket to that and proceed
	if (bracket.child(0).tagName().equals("tbody")) {
		bracket = bracket.child(0);
	}

	// check the first tr's first td element...if it has class
	// 'align_right seed', then this is a preliminary bracket, otherwise
	// it's the final four.
	if (bracket.child(0).child(0).className()
			.equals("align_right seed"))
		readPreliminaryBracket(bracket);
	else
		readFinalFour(bracket);

The first year I did this (2014), Sports Reference only had digital data going back to 1998 (prior to that they only had photo-copied results as images). They were clearly in the process of digitizing their data because the next year and the following (2015 then 2016) they had more digital data. Unfortunately, they had also updated their website. So every year I did this (2014-2016) I had to rewrite the HTML scraping code. Luckily though, it was a one and done thing: once it was written, I had all of the data I needed for that year.

Machine Learning

I used Weka to process my data because I had previously used Weka in my Machine Learning class. The first thing I did was that I took the data I gathered above and put it into an ARFF file. I was then able to manually use Weka to process data for me.

Classification Problem

I saw this as a classification problem. Each data set represents a game: two sets of team data. The task is to classify the data set as either "team 1" or "team 2" (the winner of the game). There is immediately going to be a problem with my training set if I don't take precautions. Theoretically, this determination is symmetric: meaning if I switch which is team1 and which is team2, the algorithm should also switch its answer. If the data is organized (when I gather it) so that I end up putting all of the winners as team1, then training set will just recognize that team1 always wins (or almost always wins) and predict that. Instead, before I got to Weka, I randomized the ARFF files so that this is unlikely to happen.

Raw Data

Initially I just put all of the data I had into the ARFF as is. I got data that looked something like this:


This data is somewhat difficult to analyze just looking at it. Look at the data for seed1 and seed2: they are basically mirror images of each other. If seed1 is high (e.g. seed 1 is the highest seed), then team1 (blue) is likely to win and vice versa, if seed2 is high team2 (red) is likely to win.

Difference Data

Thinking about the above data, it's not giving much of a direct comparison. If you think rank is the most important stat, then you would expect something like this:

if(rank1 > rank2)
	team1.winsGame();
else
	team2.winsGame();

But this doesn't really match how the the construction of the J48 tree works. Instead it's going to create a complicated case structure where it finds values for things like seed1 and seed2 to split on. With this in mind, I decided it might be beneficial to combine each of the two stats into a difference, rather than two absolute values. Here is what that data looks like:


I find this data much easier to make sense of than the raw data. This is showing which data is actually predicting the winner. You can tell when the red and blue areas are separated, each to one side of 0 (i.e. the two values are equal). Take the following two: the difference in rank (left) and the difference in free throw percentage (right):


You can clearly see the red skews to the right and the blue to the left for the rank difference (suggesting a negative difference means team1 is more likely to win and positive difference, team2 more likely). However for the free throw percentage there isn't an apparent trend (they're both pretty well centered about a difference of 0). The free throw percentage isn't very predictive but if you look at the data above, the number of free throw attempts (and thus also free throws made) does show a predictive trend. This is because free throw percentages are generally high regardless and thus aren't going to be very predictive whereas the number of free throws attempted shows a team that draws a lot of fouls.

Classification

Caveat: the last time I used this was for 2016. So it's been three years since I've gone through this. Thus a lot of this section is "to the best of my recollection".

I had the most success using a J48 tree. I used some others, like JRip, ZeroR, and Bayes algorithms (BayesNet and BayesNaive) but my best results from the J48 tree were consistently about 5 percentage points more accurate than the best for all of the others.

The percentage correct was pretty similar (about 80%) for the different J48 parameters so I found that looking at the ROC Area and Kappa-statistic to be the best indicator of a good run. There were three parameters that I varied when evaluating the J48 tree:

  • Confidence Factor - Tested between 0.01 and 0.15.
  • numFolds (amount of data to use for pruning) - Tested between 1 and 4.
  • minNumObj (minimum number of instance per leaf) - Test between 10 and 15.

I probably tested more than those ranges, but those are the ones I settled on. As I recall, I spent quite a bit of time experimenting with those numbers to try and narrow in on a range that gave me good results.

Converting to Javascript

Since I was using a website to generate my bracket from Javascript code (given their data and methods), I had to convert the tree generated by Weka to a javascript routine. I actually never generated a bracket myself so unfortunately, I don't have a bracket to show.

Once I settled on parameters for the J48 tree, the output from Weka looked something like this:

fgMadeDiff <= 1
|   seedDiff <= 3
|   |   tovPerGameDiff <= -2.713964: team1 (23.77/7.02)
|   |   tovPerGameDiff > -2.713964
|   |   |   ftAttemptsDiff <= -136: team2 (41.13/5.0)
|   |   |   ftAttemptsDiff > -136
|   |   |   |   fgAttemptsDiff <= -105: team2 (116.47/30.48)
|   |   |   |   fgAttemptsDiff > -105
|   |   |   |   |   rebPerGameDiff <= -1.4613: team1 (23.39/2.91)
|   |   |   |   |   rebPerGameDiff > -1.4613
|   |   |   |   |   |   drbDiff <= -38: team2 (19.21/6.09)
|   |   |   |   |   |   drbDiff > -38: team1 (48.04/21.82)
|   seedDiff > 3: team2 (283.0/34.0)
fgMadeDiff > 1
|   seedDiff <= -8: team1 (177.0/8.0)
|   seedDiff > -8
|   |   seedDiff <= 8
|   |   |   fgAttemptsDiff <= 225
|   |   |   |   seedDiff <= 4
|   |   |   |   |   stlDiff <= 66
|   |   |   |   |   |   rebPerGameDiff <= -2.385714: team1 (29.0/1.0)
|   |   |   |   |   |   rebPerGameDiff > -2.385714
|   |   |   |   |   |   |   totRebDiff <= -75: team2 (7.0)
|   |   |   |   |   |   |   totRebDiff > -75: team1 (114.0/39.0)
|   |   |   |   |   stlDiff > 66: team2 (38.0/14.0)
|   |   |   |   seedDiff > 4
|   |   |   |   |   astPerGameDiff <= 1.95
|   |   |   |   |   |   rebPerGameDiff <= 0.970656: team1 (23.0/8.0)
|   |   |   |   |   |   rebPerGameDiff > 0.970656: team2 (10.0/1.0)
|   |   |   |   |   astPerGameDiff > 1.95: team2 (11.0/1.0)
|   |   |   fgAttemptsDiff > 225: team1 (144.0/24.0)
|   |   seedDiff > 8: team2 (26.0/5.0)

This is just a giant if-else tree, so I just needed to correctly parse it to translate to javascript (this was automated and above this are the definitions for all of the values):

if(fgMadeDiff <= 1){
	if(seedDiff <= 3){
		if(tovPerGameDiff <= -2.713964){
			team1.winsGame();
		} else{
			if(ftAttemptsDiff <= -136){
				team2.winsGame();
			} else{
				if(fgAttemptsDiff <= -105){
					team2.winsGame();
				} else{
					if(rebPerGameDiff <= -1.4613){
						team1.winsGame();
					} else{
						if(drbDiff <= -38){
							team2.winsGame();
						} else{
							team1.winsGame();
						}
					}
				}
			}
		}
	} else{
		team2.winsGame();
	}
} else{
	if(seedDiff <= -8){
		team1.winsGame();
	} else{
		if(seedDiff <= 8){
			if(fgAttemptsDiff <= 225){
				if(seedDiff <= 4){
					if(stlDiff <= 66){
						if(rebPerGameDiff <= -2.385714){
							team1.winsGame();
						} else{
							if(totRebDiff <= -75){
								team2.winsGame();
							} else{
								team1.winsGame();
							}
						}
					} else{
						team2.winsGame();
					}
				} else{
					if(astPerGameDiff <= 1.95){
						if(rebPerGameDiff <= 0.970656){
							team1.winsGame();
						} else{
							team2.winsGame();
						}
					} else{
						team2.winsGame();
					}
				}
			} else{
				team1.winsGame();
			}
		} else{
			team2.winsGame();
		}
	}
}